177 research outputs found

    An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming

    Full text link
    The main advantage of Constraint Programming (CP) approaches for sequential pattern mining (SPM) is their modularity, which includes the ability to add new constraints (regular expressions, length restrictions, etc). The current best CP approach for SPM uses a global constraint (module) that computes the projected database and enforces the minimum frequency; it does this with a filtering algorithm similar to the PrefixSpan method. However, the resulting system is not as scalable as some of the most advanced mining systems like Zaki's cSPADE. We show how, using techniques from both data mining and CP, one can use a generic constraint solver and yet outperform existing specialized systems. This is mainly due to two improvements in the module that computes the projected frequencies: first, computing the projected database can be sped up by pre-computing the positions at which an symbol can become unsupported by a sequence, thereby avoiding to scan the full sequence each time; and second by taking inspiration from the trailing used in CP solvers to devise a backtracking-aware data structure that allows fast incremental storing and restoring of the projected database. Detailed experiments show how this approach outperforms existing CP as well as specialized systems for SPM, and that the gain in efficiency translates directly into increased efficiency for other settings such as mining with regular expressions.Comment: frequent sequence mining, constraint programmin

    Constraint-based sequence mining using constraint programming

    Full text link
    The goal of constraint-based sequence mining is to find sequences of symbols that are included in a large number of input sequences and that satisfy some constraints specified by the user. Many constraints have been proposed in the literature, but a general framework is still missing. We investigate the use of constraint programming as general framework for this task. We first identify four categories of constraints that are applicable to sequence mining. We then propose two constraint programming formulations. The first formulation introduces a new global constraint called exists-embedding. This formulation is the most efficient but does not support one type of constraint. To support such constraints, we develop a second formulation that is more general but incurs more overhead. Both formulations can use the projected database technique used in specialised algorithms. Experiments demonstrate the flexibility towards constraint-based settings and compare the approach to existing methods.Comment: In Integration of AI and OR Techniques in Constraint Programming (CPAIOR), 201

    Flexible constrained sampling with guarantees for pattern mining

    Get PDF
    Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling algorithms have been proposed, but each of them has its limitations when it comes to 1) flexibility in terms of quality measures and constraints that can be used, and/or 2) guarantees with respect to sampling accuracy. We therefore present Flexics, the first flexible pattern sampler that supports a broad class of quality measures and constraints, while providing strong guarantees regarding sampling accuracy. To achieve this, we leverage the perspective on pattern mining as a constraint satisfaction problem and build upon the latest advances in sampling solutions in SAT as well as existing pattern mining algorithms. Furthermore, the proposed algorithm is applicable to a variety of pattern languages, which allows us to introduce and tackle the novel task of sampling sets of patterns. We introduce and empirically evaluate two variants of Flexics: 1) a generic variant that addresses the well-known itemset sampling task and the novel pattern set sampling task as well as a wide range of expressive constraints within these tasks, and 2) a specialized variant that exploits existing frequent itemset techniques to achieve substantial speed-ups. Experiments show that Flexics is both accurate and efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal (ECML/PKDD 2017 journal track

    Constraint Programming for Multi-criteria Conceptual Clustering

    Get PDF
    International audienceA conceptual clustering is a set of formal concepts (i.e., closed itemsets) that defines a partition of a set of transactions. Finding a conceptual clustering is an N P-complete problem for which Constraint Programming (CP) and Integer Linear Programming (ILP) approaches have been recently proposed. We introduce new CP models to solve this problem: a pure CP model that uses set constraints, and an hybrid model that uses a data mining tool to extract formal concepts in a preprocessing step and then uses CP to select a subset of formal concepts that defines a partition. We compare our new models with recent CP and ILP approaches on classical machine learning instances. We also introduce a new set of instances coming from a real application case, which aims at extracting setting concepts from an Enterprise Resource Planning (ERP) software. We consider two classic criteria to optimize, i.e., the frequency and the size. We show that these criteria lead to extreme solutions with either very few small formal concepts or many large formal concepts, and that compromise clusterings may be obtained by computing the Pareto front of non dominated clusterings

    Prefix-Projection Global Constraint for Sequential Pattern Mining

    Full text link
    Sequential pattern mining under constraints is a challenging data mining task. Many efficient ad hoc methods have been developed for mining sequential patterns, but they are all suffering from a lack of genericity. Recent works have investigated Constraint Programming (CP) methods, but they are not still effective because of their encoding. In this paper, we propose a global constraint based on the projected databases principle which remedies to this drawback. Experiments show that our approach clearly outperforms CP approaches and competes well with ad hoc methods on large datasets

    Competition and facilitation between the marine nitrogen-fixing <i>cyanobacterium</i> Cyanothece and its associated bacterial community

    Get PDF
    N2-fixing cyanobacteria represent a major source of new nitrogen and carbon for marine microbial communities, but little is known about their ecological interactions with associated microbiota. In this study we investigated the interactions between the unicellular N2-fixing cyanobacterium Cyanothece sp. Miami BG043511 and its associated free-living chemotrophic bacteria at different concentrations of nitrate and dissolved organic carbon and different temperatures. High temperature strongly stimulated the growth of Cyanothece, but had less effect on the growth and community composition of the chemotrophic bacteria. Conversely, nitrate and carbon addition did not significantly increase the abundance of Cyanothece, but strongly affected the abundance and species composition of the associated chemotrophic bacteria. In nitrate-free medium the associated bacterial community was co-dominated by the putative diazotroph Mesorhizobium and the putative aerobic anoxygenic phototroph Erythrobacter and after addition of organic carbon also by the Flavobacterium Muricauda. Addition of nitrate shifted the composition toward co-dominance by Erythrobacter and the Gammaproteobacterium Marinobacter. Our results indicate that Cyanothece modified the species composition of its associated bacteria through a combination of competition and facilitation. Furthermore, within the bacterial community, niche differentiation appeared to play an important role, contributing to the coexistence of a variety of different functional groups. An important implication of these findings is that changes in nitrogen and carbon availability due to, e.g., eutrophication and climate change are likely to have a major impact on the species composition of the bacterial community associated with N2-fixing cyanobacteria

    An index to quantify an individual's scientific research output that takes into account the effect of multiple coauthorship

    Full text link
    I propose the index \hbar ("hbar"), defined as the number of papers of an individual that have citation count larger than or equal to the \hbar of all coauthors of each paper, as a useful index to characterize the scientific output of a researcher that takes into account the effect of multiple coauthorship. The bar is higher for \hbar.Comment: A few minor changes from v1. To be published in Scientometric
    corecore